“Provide background on your data sets and a clear formulated question or hypothesis.”
For this project, my question of Interest is: “How does crime rate relate to poverty in Canada?”
In order to answer this question, data on both crime and socioeconomic status are needed. However, I found no existing data set that contains all desired information, therefore this needs to be achieved through merging more than one data sets. Aftering choosing carefully, the following two separate data sets are obtained:
“Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas”. Released 2023-05-02. This data set is annually updated and maintained by Statistics Canada (Table 11-10-0239-01). Data is collected through the Survey of Labor and Income Dynamics, Survey of Consumer Finances, and Canadian Income Survey.
“Incident-based crime statistics, by detailed violations, Canada, provinces, territories, Census Metropolitan Areas and Canadian Forces Military Police”. Released 2023-07-27. This data set is also annually updated and maintained by Statistics Canada (Table 35-10-0177-01, formerly CANSIM 252-0051). Data is collected through the Uniform Crime Reporting Survey.
Understanding the relationship between crime rates and poverty in Canada is crucial for policymakers, law enforcement agencies, and social welfare programs. Exploring this correlation can shed light on the socioeconomic factors driving criminal behavior and help formulate targeted interventions to alleviate poverty and reduce crime. Additionally, elucidating this connection can inform broader discussions on social inequality, justice, and community well-being in Canadian society.
“Include how and where the data were acquired, how you cleaned and wrangled the data, what tools you used for data exploration.”
Both data sets are downloaded directly from Statistics
Canada, which is usually considered to be an reliable source.
Because they share the same source, the data sets follows similar
structure and all contains the two columns GEO and
REF_DATE where the former one refers to the geographical
region and the second one refers to the year of data. Thus, it’s
possible to combine the two data sets to obtain all information
needed.
However, it is worth mentioning that both data sets are huge and contains unrelated information. Therefore, cleaning and wrangling are needed for more convenient analysis and more efficient computing & uploading, as the original data sets are oversize thus cannot be pushed to github repository.
Reference:
fn1 <- "https://raw.githubusercontent.com/inorrr/JSC370_project/main/census.csv"
fn2 <- "https://raw.githubusercontent.com/inorrr/JSC370_project/main/crime.csv"
if (!file.exists("census.csv"))
download.file(fn1, destfile = "census.csv")
census_df <- data.table::fread("census.csv")
if (!file.exists("crime.csv"))
download.file(fn2, destfile = "crime.csv")
crime_df <- data.table::fread("crime.csv")
crime_df <- crime_df[, c("REF_DATE", "GEO", "Violations", "Statistics", "VALUE", "UOM")]
census_df <- census_df[, c("REF_DATE", "GEO", "Age group", "Sex", "Income source", "Statistics", "VALUE", "UOM", "SCALAR_FACTOR")]
table(census_df$GEO)
provinces1 <- c("Alberta [48]", "British Columbia [59]", "Manitoba [46]", "New Brunswick [13]",
"Newfoundland and Labrador [10]", "Saskatchewan [47]",
"Nova Scotia [12]", "Ontario [35]",
"Prince Edward Island [11]", "Quebec [24]")
provinces2 <- c("Alberta", "British Columbia", "Manitoba", "New Brunswick",
"Newfoundland and Labrador", "Saskatchewan","Nova Scotia",
"Ontario", "Prince Edward Island", "Quebec")
crime_df <- crime_df[crime_df$GEO %in% provinces1, ]
census_df <- census_df[census_df$GEO %in% provinces2, ]
crime_df$GEO <- gsub("\\s*\\[\\d+\\]$", "", crime_df$GEO)
table(crime_df$GEO)
Actual incidents) and crime rate
(Rate per 100,000 population), thus statistics related to
charges are removed. The Crime Severity
Index(Percentage contribution to the Crime Severity Index (CSI))
seems to be interesting and thus is kept.crime_df <- crime_df %>% filter(Statistics == "Actual incidents" |
Statistics == "Rate per 100,000 population" |
Statistics == "Percentage contribution to the Crime Severity Index (CSI)")
print(length(unique(crime_df$Violations)))
table(crime_df$Violations)
# Identify rows that start with "Total"
total_rows <- grepl("^Total", crime_df$Violations)
# Subset the dataframe to keep only the rows starting with "Total"
crime_df <- crime_df[total_rows, , drop = FALSE]
# Remove square brackets and numbers at the end
crime_df$Violations <- gsub("\\s*\\[\\d+\\]$", "", crime_df$Violations)
the column Age group specifies the age however since we do not have this information in the crime df, we need to combine all age groups. This can be done by taking the average of the categories.
Same for sex, same method is used.
table(census_df$"Age group")
table(census_df$"Sex")
# first we merge the age group categories
census_df <- census_df %>%
group_by(REF_DATE, GEO, Sex, `Income source`, Statistics, UOM, SCALAR_FACTOR) %>%
summarise(VALUE = mean(VALUE, na.rm = TRUE))
# next we merge the age group categories
census_df <- census_df_new %>%
group_by(REF_DATE, GEO, `Income source`, Statistics, UOM, SCALAR_FACTOR) %>%
summarise(VALUE = mean(VALUE, na.rm = TRUE))
crime_df <- crime_df %>% filter(REF_DATE >= 1998 & REF_DATE <= 2021)
census_df <- census_df %>% filter(REF_DATE >= 1998 & REF_DATE <= 2021)
write.csv(crime_df, "/Users/yinuozhao/Desktop/UofT/JSC370/JSC370-2024-main/JSC370_project/crime.csv")
write.csv(census_df, "/Users/yinuozhao/Desktop/UofT/JSC370/JSC370-2024-main/JSC370_project/census.csv")
At this point both the crime data frame and census data frame has
REF_DATE and GEO in common, and they each have
another categorical variable, which is
Income source for census data and Violation(it
means crime type) for crime data. While it may seem to make sense to
join the two data sets using REF_DATE and GEO directly, the results
would involves the data for all combinations of Income source and
Violation for each REF_DATE and GEO. This will be a huge data set and
thus slow down the computation. Therefore, I choose to keep the
data sets separate and join them when
necessary (i.e. after picking out certain categories of
interest).
Notice that right now both data sets are in long format, I converted them to wide for convenience.
crime_df <- pivot_wider(crime_df, id_cols = c(REF_DATE, GEO, Violations),
names_from = Statistics, values_from = VALUE)
crime_df <- na.omit(crime_df)
census_df <- pivot_wider(census_df, id_cols = c(REF_DATE, GEO, `Income source`),
names_from = Statistics, values_from = VALUE)
census_df <- na.omit(census_df)
Check the dimensions and headers and footers of the data
dim(census_df)
dim(crime_df)
head(crime_df)
tail(crime_df)
head(census_df)
tail(census_df)
The census data set has 8 variables with 3613 observations, the crime dataset has 6 variables with 9271 observations. By looking at the headers and footers of both data sets, they seems to be imported correctly and contains no missing values (in the displayed rows).
Check the variable types in the data
str(census_df)
str(crime_df)
summary(census_df)
summary(crime_df)
In both data frames, we see that the variable types are a mix of integer, numeric and characters. All variable types correctly align with the context of the variables. No major problems arises with the data at this stage (i.e. a variable with all missing values.)
Take a closer look at some/all of the variables
For both data frame, we need REF_DATE and
GEO to correctly identify a province in Canada with a valid
year. For census data frame, we need to look at the values of the
different types of income (median, aggregate, etc). For the crime data
frame, we need to look at the recorded crime rate and actual number of
incidents to be within the reasonable range.
table(census_df$REF_DATE)
table(census_df$GEO)
table(crime_df$REF_DATE)
table(crime_df$GEO)
summary(census_df$`Aggregate income`)
summary(census_df$`Average income (excluding zeros)`)
summary(census_df$`Median income (excluding zeros)`)
summary(crime_df$`Actual incidents`)
summary(crime_df$`Rate per 100,000 population`)
Both data sets contains data from 1998 to 2021, on the 10 provinces in Canada as desired because I cleaned the data sets this way. Other variables being checked are within the reasonable range. The aggregate income, average income and median income are all measured in 2021 constant dollars, aggregate income record numbers in millions. The crime rates are measured as number of incidents per 100,000 population.
Validate with an external source
Notice that the minimum average income is 677.8, which seems to be much lower than then mean average income, even 10 times lower than the 1st quantile. Since it seems quite suspisous, we need to validate it.
census_df[which.min(census_df$`Average income (excluding zeros)`), ]
This data is from Prince Edward Island in 2004, and the income source is “other government transfers”. Upon research, Government transfers refers to assistance from provincial and municipal programs, Workers’ Compensation benefits, the GST/HST Credit and provincial refundable tax credits such as the Quebec and Newfoundland and Labrador sales tax credits. However, since many of the above mentioned are made to their own category and excluded from “other government transfers” in the data set, it make sense that the value is low.
Provide summary statistics in tabular from and publication-quality figures, take a look at the kable function from knitr to write nice tables in Rmarkdown
filtered_data = crime_df %>% filter(Violations=="Total, all violations")
unique_x_values <- unique(crime_df$REF_DATE)
ggplot(filtered_data, aes(x = REF_DATE, y = `Rate per 100,000 population`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Rate per 100,000 population", title = "Rates of Total Crime by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "right") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
The plot depicts the total crime rate, encompassing all
types of crimes, from 1998 to 2021, with each province represented by a
distinct color. The crime rate is measured per 100,000 population.
Notably, Saskatchewan consistently exhibits a significantly higher crime
rate compared to other provinces throughout the period of 1998 to 2021.
Conversely, Quebec and Ontario consistently demonstrate the lowest crime
rates. Across all provinces, there is a discernible decreasing trend in
crime rates over the years, with many provinces experiencing peak crime
rates in 2003-2004.
In addition to analyzing total crime, I delved into specific crime categories often associated with poverty: break and enter, robbery, and prostitution. The three accompanying plots illustrate their respective rates over the years. Overall, there is a decreasing trend in the rates of all three crimes, with occasional exceptions such as robbery rates in Manitoba. Notably, British Columbia stands out with a significantly high rate of prostitution in 2004, doubling the number reported in Saskatchewan, which held the second-highest rate that year.
Note that the 3 plots below shares the same legend with the above plots. Therefore the legend is omited for better display purpose. Codes are also not shown as they reuses the above code chunk.
Below is a table summarizing the average crime rate of all crime
types (Total, all violations is excluded) by province and
year. The cells are colored by value, where lighter color means higher
value and darker color means smaller value. Scrolling down the table, we
can see that for all provinces, the average crime rate has a decreasing
trend as the colors are getting dark in each column. This agree with
what I found and infered from the previous plots.
| Year | Alberta | British Columbia | Manitoba | New Brunswick | Newfoundland and Labrador | Nova Scotia | Ontario | Prince Edward Island | Quebec | Saskatchewan |
|---|---|---|---|---|---|---|---|---|---|---|
| 1998 | 990.7389 | 1310.2795 | 1140.4897 | 746.1692 | 644.1653 | 875.0903 | 776.1408 | 686.4670 | 702.7424 | 1420.691 |
| 1999 | 996.9833 | 1243.8727 | 1139.4081 | 751.8170 | 616.3186 | 890.2262 | 705.8405 | 786.3608 | 649.5519 | 1372.294 |
| 2000 | 938.1432 | 1202.5849 | 1146.4238 | 719.6546 | 621.9546 | 807.6322 | 695.1511 | 739.1711 | 653.3770 | 1428.722 |
| 2001 | 979.9516 | 1233.4349 | 1213.3189 | 712.1708 | 617.2849 | 812.8562 | 672.0454 | 752.6111 | 634.2719 | 1518.370 |
| 2002 | 994.3392 | 1245.2322 | 1202.5438 | 726.1322 | 639.0897 | 816.2289 | 651.8003 | 850.6072 | 617.0995 | 1505.492 |
| 2003 | 1065.0392 | 1317.9670 | 1366.9247 | 758.0414 | 657.4530 | 893.6070 | 632.2778 | 915.8641 | 638.6003 | 1728.543 |
| 2004 | 1098.5319 | 1315.9251 | 1347.9319 | 780.0538 | 667.5708 | 915.0119 | 595.3678 | 893.0314 | 624.6361 | 1684.091 |
| 2005 | 1079.1975 | 1260.0459 | 1278.7128 | 692.0257 | 650.8968 | 859.9862 | 574.3292 | 812.8373 | 615.6142 | 1649.431 |
| 2006 | 1007.2778 | 1203.3584 | 1260.0861 | 648.6084 | 667.1994 | 856.8851 | 589.4051 | 737.7965 | 609.9108 | 1502.769 |
| 2007 | 1018.9281 | 1137.0500 | 1226.7060 | 608.8173 | 697.5675 | 803.5808 | 560.8689 | 670.0778 | 584.9914 | 1480.692 |
| 2008 | 984.9581 | 1046.3411 | 1081.2483 | 624.8186 | 696.9058 | 772.9158 | 537.7905 | 683.8757 | 589.6462 | 1412.581 |
| 2009 | 935.6630 | 983.8919 | 1153.1356 | 615.0719 | 716.9997 | 773.6461 | 521.8932 | 697.9243 | 578.2262 | 1403.483 |
| 2010 | 884.1551 | 933.4835 | 1043.1349 | 625.0964 | 725.7876 | 763.8995 | 496.5708 | 699.9311 | 562.8811 | 1432.178 |
| 2011 | 812.9859 | 880.2127 | 1012.0920 | 577.9611 | 705.3294 | 714.0051 | 468.0995 | 710.5243 | 519.5841 | 1408.674 |
| 2012 | 819.5778 | 856.2392 | 1000.7560 | 615.4689 | 696.8877 | 714.1444 | 449.8354 | 721.6681 | 510.7627 | 1311.701 |
| 2013 | 778.9997 | 803.0581 | 866.5694 | 538.5750 | 682.2809 | 639.5953 | 408.5759 | 669.5086 | 460.8854 | 1218.580 |
| 2014 | 783.3216 | 792.6345 | 840.0633 | 482.3608 | 636.2366 | 587.6563 | 391.6511 | 510.4745 | 421.5565 | 1155.519 |
| 2015 | 855.7324 | 827.4308 | 894.7058 | 527.3759 | 635.6389 | 551.9057 | 391.9432 | 451.0126 | 399.5150 | 1258.409 |
| 2016 | 869.6887 | 818.0165 | 956.7842 | 505.9946 | 630.6538 | 538.1619 | 399.0192 | 468.5732 | 405.6246 | 1317.321 |
| 2017 | 927.8811 | 781.4886 | 955.9997 | 549.3957 | 583.0170 | 549.4824 | 417.5035 | 439.5063 | 410.9189 | 1271.176 |
| 2018 | 886.8923 | 718.8339 | 912.0272 | 497.1263 | 502.9756 | 493.3066 | 400.3605 | 447.9967 | 339.5035 | 1188.911 |
| 2019 | 925.4612 | 796.9984 | 1016.7941 | 544.1014 | 553.7074 | 498.5129 | 404.5249 | 518.0077 | 326.3725 | 1151.130 |
| 2020 | 788.6307 | 735.6162 | 943.0979 | 564.7014 | 557.8670 | 491.2514 | 356.3917 | 446.2160 | 301.0057 | 1085.383 |
| 2021 | 730.3198 | 703.6836 | 839.5900 | 590.1656 | 602.4644 | 489.7486 | 360.7433 | 418.9484 | 312.5548 | 1106.370 |
Given the general decrease in crime rates, I am interested in exploring the trend of income to ascertain the potential existence of an association.
| Year | Alberta | British Columbia | Manitoba | New Brunswick | Newfoundland and Labrador | Nova Scotia | Ontario | Prince Edward Island | Quebec | Saskatchewan |
|---|---|---|---|---|---|---|---|---|---|---|
| 1998 | 44804.17 | 40441.67 | 38208.33 | 34125.00 | 30316.67 | 34320.83 | 45100.00 | 32633.33 | 37412.50 | 37370.83 |
| 1999 | 44237.50 | 41416.67 | 38075.00 | 35112.50 | 31783.33 | 36316.67 | 46620.83 | 33058.33 | 38520.83 | 38000.00 |
| 2000 | 45795.83 | 41462.50 | 38841.67 | 35654.17 | 32533.33 | 37070.83 | 48170.83 | 34225.00 | 39716.67 | 38579.17 |
| 2001 | 47845.83 | 41916.67 | 39650.00 | 36537.50 | 32600.00 | 37950.00 | 48279.17 | 34445.83 | 40520.83 | 40308.33 |
| 2002 | 46983.33 | 42695.83 | 39629.17 | 35870.83 | 33270.83 | 38487.50 | 48091.67 | 35020.83 | 40791.67 | 40150.00 |
| 2003 | 47920.83 | 41787.50 | 40129.17 | 36108.33 | 33125.00 | 37716.67 | 47812.50 | 35325.00 | 40570.83 | 40554.17 |
| 2004 | 49666.67 | 42941.67 | 41029.17 | 36433.33 | 33470.83 | 38187.50 | 48170.83 | 36070.83 | 41600.00 | 40345.83 |
| 2005 | 51758.33 | 43800.00 | 41700.00 | 36400.00 | 35012.50 | 39283.33 | 48395.83 | 36800.00 | 40804.17 | 42166.67 |
| 2006 | 53820.83 | 44679.17 | 42562.50 | 37479.17 | 36858.33 | 40408.33 | 47258.33 | 38000.00 | 41604.17 | 44904.17 |
| 2007 | 56879.17 | 45850.00 | 44633.33 | 39200.00 | 39304.17 | 41320.83 | 48079.17 | 38120.83 | 42441.67 | 47750.00 |
| 2008 | 57337.50 | 46858.33 | 45875.00 | 39520.83 | 40412.50 | 40933.33 | 49212.50 | 39500.00 | 41925.00 | 49100.00 |
| 2009 | 57237.50 | 45875.00 | 45437.50 | 40466.67 | 40500.00 | 42362.50 | 48404.17 | 39587.50 | 42545.83 | 50650.00 |
| 2010 | 57650.00 | 45400.00 | 45129.17 | 40470.83 | 42195.83 | 41579.17 | 48891.67 | 39841.67 | 42662.50 | 50337.50 |
| 2011 | 59145.83 | 45633.33 | 44687.50 | 41508.33 | 44045.83 | 42516.67 | 48133.33 | 41354.17 | 43833.33 | 52483.33 |
| 2012 | 62166.67 | 46400.00 | 45195.83 | 41458.33 | 46112.50 | 43358.33 | 48645.83 | 40562.50 | 44308.33 | 52687.50 |
| 2013 | 61962.50 | 47854.17 | 46837.50 | 42095.83 | 48583.33 | 45220.83 | 49600.00 | 42529.17 | 44716.67 | 53658.33 |
| 2014 | 63200.00 | 47929.17 | 47129.17 | 42545.83 | 49654.17 | 45504.17 | 50062.50 | 42658.33 | 45041.67 | 55975.00 |
| 2015 | 64020.83 | 47437.50 | 48054.17 | 42104.17 | 50033.33 | 45445.83 | 51083.33 | 43308.33 | 44354.17 | 54908.33 |
| 2016 | 57637.50 | 47354.17 | 47425.00 | 43383.33 | 48466.67 | 45908.33 | 51191.67 | 43366.67 | 45983.33 | 52816.67 |
| 2017 | 59825.00 | 50650.00 | 49529.17 | 44858.33 | 48633.33 | 45612.50 | 52683.33 | 44245.83 | 46412.50 | 53900.00 |
| 2018 | 59458.33 | 51012.50 | 48875.00 | 46133.33 | 49154.17 | 46508.33 | 52650.00 | 44962.50 | 47091.67 | 52412.50 |
| 2019 | 58787.50 | 51791.67 | 48016.67 | 45779.17 | 48695.83 | 46437.50 | 52162.50 | 44616.67 | 48854.17 | 51304.17 |
| 2020 | 57991.67 | 54508.33 | 50758.33 | 48325.00 | 50183.33 | 48558.33 | 55320.83 | 47558.33 | 51000.00 | 53500.00 |
| 2021 | 59270.83 | 55733.33 | 50300.00 | 48358.33 | 51308.33 | 49062.50 | 56441.67 | 47737.50 | 52195.83 | 53025.00 |
summary_stats <- joint_df %>%
group_by(GEO) %>%
summarise(Correlation = cor(`Average income (excluding zeros)`, `Rate per 100,000 population`))
summary_stats <- summary_stats %>% rename(Province = GEO)
summary_table <- summary_stats %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
add_header_above(c("Correlation between Average Total Income and Total Crime Rate" = 2)) %>%
kable_styling(fixed_thead = T)
summary_table
| Province | Correlation |
|---|---|
| Alberta | -0.7023852 |
| British Columbia | -0.8154907 |
| Manitoba | -0.7392303 |
| New Brunswick | -0.5966958 |
| Newfoundland and Labrador | -0.0025743 |
| Nova Scotia | -0.9236636 |
| Ontario | -0.7583329 |
| Prince Edward Island | -0.8082244 |
| Quebec | -0.9447648 |
| Saskatchewan | -0.7695228 |
Pick a province to look closer
What you found so far from your data in terms of the formulated question.